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Abstract 

Scoring functions are used to evaluate and compare partially prob¬ 
abilistic forecasts. We investigate the use of rank-sum functions such 
as empirical Area Under the Curve (AUC), a widely-used measure of 
classification performance, as a scoring function for the prediction of 
probabilities of a set of binary outcomes. It is shown that the AUC is 
not generally a proper scoring function, that is, under certain circum¬ 
stances it is possible to improve on the expected AUC by modifying 
the quoted probabilities from their true values. However with some 
restrictions, or with certain modifications, it can be made proper. 

Keywords: scoring rules, scoring functions, area under the curve. 


1 Introduction 

Predicting the outcomes of multiple binary variables is a common problem 
across a variety of application domains, such as fraud detection, credit risk 
evaluation, medical diagnostics and weather forecasting. Such forecasts typ¬ 
ically carry some information describing the uncertainty of the forecaster, 
such as assigning explicit probabilities or some other numerical value to each 
variable that allows the variables to be ranked in order of relative probability 
of occurrence. 

This paper investigates numerical measures for evaluating and comparing 
the accuracy of such forecasts. Although such measures have always been 
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important for comparing algorithms, their role has become increasingly im¬ 
portant with the popularity of prediction competitions, where it is necessary 
to precisely quantify the performance of participants. In particular, we use 
the framework of scoring functions, which maps the prediction and subse¬ 
quent observation to a single real number, the score, representing the reward 
to the forecaster. The aim of the forecaster is then to maximise this reward. 

Scoring functions can be viewed as extensions of scoring rules fsection r2.ip . 
which require that the forecast be fully probabilistic, providing a full joint 
probability distribution over the set of all possible outcomes, which can be 
infeasible and unnecessary in many situations. Scoring functions (section [22]) 
on the other hand can make use of partial probabilistic information such as 
marginal distributions, or rankings of expected values. One desirable feature 
of both scoring rules and scoring functions is that they be proper: that the 
forecaster always has the incentive to be honest, in that the forecast which 
maximises their expected score matches their true belief. 

The focus of this paper is on a class of scoring functions termed rank-sum 
functions (section [3D, the most well-known of which is the area under the 
curve (AUC), the curve in question being the receiver operating characteristic 
(ROC). The ROC and AUC describe the usefulness of the forecast in terms of 
its ability to discriminate between positive and negative outcomes. Note that 
this paper specihcally focuses on the empirical AUC, and not the theoretical 
quantity that is perhaps more often studied: this distinction is explained in 
detail in section 13.11 

The main results (section 13.21) identify sufficient conditions for rank-sum 
scoring functions to be proper for evaluating the accuracy of forecasts of the 
marginal probabilities of a sequence of binary forecasts. In general, the AUC 
is not of this class, and a counter-example is provided which demonstrates a 
case in which the AUC is not a proper scoring function, in that there exist 
distributions under which the forecaster might improve their expected score 
by quoting probabilities different than their true belief. 

This framework can be further extended to the case where instead of mak¬ 
ing a direct prediction, the forecaster is required to provide a mapping that 
indirectly makes predictions from an as-yet unobserved covariate (section HD- 
In section [31 we discuss some open questions, and problems with extending 
the framework to a sequential setting. 
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2 Scoring of forecasts for binary outcomes 


2.1 Scoring rules 

Consider the setting where one is eliciting forecasts abont some fnture ont- 
come V that takes valnes in an outcome space y. A probabilistic forecast is 
a distribntion Q for Y that describes the forecasters nncertainty of Y. We 
dehne to be a family of distribntions over y that are nnder consideration. 

After the actnal ontcome Y = y is observed, the reward to the forecaster 
is determined by a scoring rule, a fnnction S' : y x —)■ R, that maps the 
qnoted Q and observed ontcome y to a. real nnmber S{y, Q) termed the score. 
We take scoring rnles to be positively oriented, that is the score represents 
the reward to the forecaster, who therefore aims to maximise this qnantity. 
In a decision theoretic context, the negation of the score can be considered a 
loss function. Mathematically, the problem can b e precisely phrased in the 


form of a game between a Forecaster and Natnre I Dawid et ah . 2ni2h . 


For any P E if, we can then dehne the expected score as the Ep[S'(y, Q)], 
where Y is generated from P. A scoring rnle S is proper if an optimal 
strategy for the forecaster is to qnote a distribntion that matches their actnal 
nncertainty, that is, if for aW Q,P E P, 


Ep[5(F,Q)] <Ep[5(F,P)]. 


( 1 ) 


Additionally, S is termed strictly proper if this is the only optimal strategy, 
i.e. ([T]) is an eqnality only ii Q = P. Proper scoring rnles for discrete variables 
have been extensively stndied {e.g. iDawid et al.l. 120121) : common examples 
inclnde the Brier, spherical and the log scores. 

In this paper, we will consider the ontcome space to be a vector of binary 
variables, 

r = (W,...,W)ey = {o,ir. 


In this case, the distribntion Q takes valnes on A 2 n_i, the (2"'—l)-dimensional 
nnit simplex. If the family P is the set of all snch distribntions, then for large 
valnes of n this can place a large bnrden in terms of time and resonrces in 
constrncting, commnnicating and evalnating the score of the forecast. This 
motivates a more flexible framework. 


2.2 Scoring functions 

Snppose that instead of snpplying a distribntion Q from a family P, we 
reqnire forecaster to qnote a forecast from an arbitrary set Z, which we will 
term the prediction space. Then a scoring function is a mapping of the form 
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s : 3^ X ^ M. iGneitina fcOllI) extensively studied scoring functions in the 
context of point forecasts, where Z = y, though as we shall demonstrate, 
the concept extends directly to a more general context. 

The price of this generality is that we now need to explicitly specify 
the aspects of the forecasters uncertainty that we want to capture. This 
can be described by a (statistical) functional, a possibly set-valued function, 
T : T ^ Z or T ■. y ^ pZ, where pZ denotes the po wer set of Z. 


A scoring function s is then said to be T-proper flGneitind (120111) uses 
the term consistent) if for all P & iF, and all u E Z, 

Ep[s(F,n)] <Ep[s(F,T(P))] (2) 

for ^-valued functional T, or for a set-valued functional T, 


Ep[s(y, u)] < Ep[s(F, f)] for all t E T{P). 


(3) 


Furthermore, we can dehne s to be strictly T-proper if equality holds only if 
u = T{P) oi u E T{P), respectively. Note that the condition in ([2]) implies 
that for any proper scoring function s of a set-valued functional, the expected 
score Ep[s(F, t)] must be constant for all t E T{P). As would be expected 
from the terminology, there is a strong link between scoring functions and 
scoring rules, in tha t a (strict l y) pr oper scoring function dehnes a (strictly) 
proper scoring rule flGneitind. 1201 ll. Theorem 3). 

In this paper, we focus on two specihc classes of functionals for distribu¬ 
tions on 3^ = {0, l}*^. 


2.2.1 Marginal scoring 

Definition 1 The marginal functional M maps a joint distribution to the 
marginal probabilities of each element of Y, 

M[p) = ep1f| = (Pin = 1],.... Pin = i|). 

This functional reduces the (2" — l)-dimensional distribution space to the 
n-dimensional prediction space Z = [0,1]”. 

We can easily construct scoring functions for the marginal functional as 
functions of scoring rules for the individual elements of Y. 

Theorem 1 Let Si : {0,1} x [0,1] — ?■ M 6e a scoring rule for a single binary 
outcome, such as the logarithmic, guadratic or Brier score. Then the scoring 
function 

n 

siy,m) = ^Si{yi,mi) 
i=l 

is (strictly) M-proper if each of the Si are (strictly) proper. 
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Proof Each S'* can be maximised independently by choosing m* = E[Pi]-n 

2.2.2 Rank scoring 

Recall that a total preorder is a transitive and reflexive relation ^ such that 
for any pair i,j, at least one of i ^ j or j ^ i. Given such a we can define 
i ~ j as the symmetric relation i ^ j and i j and i -< j as the asymmetric 
relation i j (which due to totality, implies i ^ j). Note also implies a 
total ordering of the equivalence classes under 

Define to be the set of total preorders on the set of indices I = 
{1,..., n}, then any vector n G M" induees an element of by 

^ Vi<Vj. 

Definition 2 The exaet rank funetional i? ; —)■ maps a joint distribution 

to the total preorder induced by the marginal functional M. 

The exact rank functional can also be characterised in terms of pairwise 
comparisons. 

Proposition 1 Let ;^= R{P) for some distribution P on y. Then 

^ p[y>Yfi<p[y<Yfi. 

Proof By adding P[Yi = l,Yj = 1] to both sides, we have that 

P[Yi = fiYj = 0]<P[y = 0,Yj = l] ^ P[y = 1] < P[Yj = 1] □ 

In the case where all the elements of M{P) are unique, P{P) is a total order. 
We define Qn Y to be the set of all total orders on I. 

Note that the exact rank functional requires that ties (E[Pj] = E[h^]) be 
identified exactly. We define a weaker notion under which the ties can be 
ignored. A relation is eontained in a relation if that is, if i fif j 

implies that i j. 

Definition 3 The weak rank funetional P* : P ^ pEn is the set-valued func¬ 
tional that maps a probability distribution to the set of total preorders con¬ 
tained in the exact rank functional: 

R‘(P) = {:<€ S„ :;<C R(P)]. 
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As a result, if all elements of M{P) are unique, then R*{P) = {R{P)}, and 
conversely if all the elements of M{P) are equal, then R*{P) = 

Given an i?*-proper scoring function s, we can construct a M-proper scor¬ 
ing function s', via s'{y,m) = s(?/, ;^m)- Of course, such a scoring function 
can never be strictly M-proper, as is preserved under any monotonic 
increasing transformation. 

An advantage of rank-based scoring functions is that they allow the use 
of more abstract measures of propensity other than probability, and make 
it possible to compare forecasts generated by a wide variety of algorithms, 
whose outputs need not necessarily have a direct probabilistic interpretation. 
The downside is that we lose the ability to say anything about the calibration 
of the forecaster. 


3 Rank-sum scoring functions 

We now consider a particular class of rank-based scoring functions. For any 
total preorder we define its rank vector p : S„ —)■ to be the net number 
of elements that precede each element, 

n 

PiiiS) — ~ Iji:* 

i=i 

We will consider the class rank-sum scoring functions, of the form 

n 

= 9{y)+ (4) 

i=l 

for some functions g and a = 

Example 1 (Wilcoxon-Mann-Whitney u) The most well-known example of 
such a function is the Wilcoxon-Mann-Whitney u, commonly used as a non- 
parametric test statistic for comparing magnitude of two random variables. 
It is defined as the number of times observations where pi = 0 precede ob¬ 
servations where yi = 1, with ties counting as half 

u{y,:<)= + (5) 

v.yi=0 = l 

The term inside the summation is equal to |[1 + li;<j — and so 

n 

u{y, :<) = \nQ{y)ni{y) + lY 
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where Uiijj) = Ui^ ^oiv) = ~ ni{y). By symmetry, we have that 

- htj) = and hence, 

n 

u{y,S) = \nQ{y)ni{y) + \'^yipi{^). 

i=l 

For a hxed y, u will take valnes on the half-integers 0, |, 1,..., no{y)ni{y). 

Example 2 (Area under the curve) The receiver operating characteristic (ROC) 
describes the trade-off of sensitivity and specihcity (or type I and type II er¬ 
ror) of a preorder, and is caicniated by plotting the trne positive rate against 
the false positive rate that would be obtained by taking different elements of 
the preorder as the cutoff. 

It can be described as the parametric curve on [0,1] x [0,1], starting at 
(1,1), then linearly connecting the points 



no{yy 





( 6 ) 


for each equivalence class i under in the order of -<. 

The area under the curve (AUC) is then the total a rea under this curve, 
which will take values on [0,1]. It is well-established fc.q. lHanlev and McNeii 
19821) that this is in fact equal to the Wilcoxon-Mann-Whitney u, standard¬ 


ised by dividing by no{y)ni{y). 

Note that if the outcomes are identical (he. ?/ = 0 or 1 ), then the ROC 
and AUC are not properly defined. For convenience, we can define the AUC 
to be 1/2 in both these cases, however the choice of this constant does not 
affect any of the results other than Theorem [2j 
As a result, we can write 


AUC(i/, ;^) = 


2=1 


I 


Vi 


ni{y) ^ 0,n, 


where ai{y) = < no{y)ni{y) 

I 0 otherwise. 

Also related is the Gini coefficient, g{y, ffi) = 2 AUC(?/, ffi) — 1, which is twice 
the net area of the ROC above the diagonal, and takes values on [—1,1]. 


3.1 Relation to theoretical AUC 


Although the AUC has been widely explored in the l iterature, much of 
this work [e.g. lAgarwal et ahl 120051: IClemencon et ahl l2008l: iHandl. 12009 : 
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Flach et al.l. 1201 ll) focuses on a related but distinct quantity, which we will 


term the theoretical AUC. 

Let 0 be a joint distribution for a random pair (X*, Yi), where Xi, taking 
values in some set A., is termed the covariate or feature, and Lj is a single 
binary response. For some mapping / : A. —)■ R, we dehne the conditional 
CDFs Fy{z) = 9[f{Xi) < z \ Yi = y]. Then the theoretical ROC replaces the 
empirical quantities of (E]) with their theoretical equivalents, 


{l-Fo{z),l-F,{z)), 


z e 


which again, describes a curve over [0,1] x [0,1]. Similarly, the theoretical 
AUC, denoted tAUC(6*,/), is the area under this curve. 

The theoretical AUC can be rewritten as the conditional expectation {e.g. 


Clemengon et ahl. l2008l. Proposition B.2), 


tAUC(6', /) - E [l/(Xi)>/(X2) + |l/(Xi)=/(X2) I - 1, ^2 - O] , (7) 

where the expectation is with respect to the product measure oi 6 x 6 for 

[(A',.r,),(A'„n)]. 

The relationship between the empirical and theoretical AUCs is well- 
established, though for completeness we clarify the usual presentation {e.g. 


Agarwal et al.l. l2005l Lemma 2). 


Theorem 2 Let the pairs (Xi, Yi),..., (X„, Yn) be independent and identi¬ 
cally distributed as 9, then the expected empirical AUC, 


E[AUC(y, A;,X|)1 = (!-<- K) tAUC(9, /) + i(< + 


where tIc = 9{Yi = c). 

Proof For any vector y ^ 0 , 1 , the expectation of ([5]) conditional onY = y 
gives an expression of the form of ([7]), and hence E[AUC(X, ;:j/(x)) | Y = 
y]=tA\JC{9,f). □ 

We emphasise several key differences between the empirical and theoret¬ 
ical AUC. Firstly, the theoretical AUC is a function of the mapping / from 
Xj that is used to induce a ranking on Tj (confusingly, this is itself referred 
to as a “scoring function” in the literature). 

Another distinction is that the distribution 9 is now a hypothetical sam¬ 
pling model for a single pair (Xj,U), whereas the previous distribution P 
describes the forecasters uncertainty for a set (Yi,...,W). We emphasise 
that these are distinct concepts: whereas the i.i.d. assumption is typically 
reasonable in a sampling context, it is extremely unrealistic for describing 










uncertainty, in that it would imply that there is absolutely no information 
to be gained about from the other Yi,..., F„_i. 

Additionally, although the negation of tAUC(6', /) can still be interpreted 
as a loss function in the standard decision-theoretic sense {e.g. for deriving 
minimax procedures), tAUC(6*,/) cannot be used as a scoring function as 9 
is typically never observed directly. 


3.2 Proper rank-sum scoring functions 

To determine the propriety of such scoring functions, we utilise the following 
key lemma. 

Lemma 1 For any fixed vector v G M”, the quantity 

n 

( 8 ) 

i=l 


is maximised over ;^G if and only if is contained in the preorder 
induced by v. 


Proof Firstly, note that if we were to consider only total orders ;^G fin, 
then the statement is a direct result of the rearrangement inequality. For any 
total preorder ;^G define A(;^) to be the set of total orders contained in 
that is A(;:5) = nhln. Then for any i,j, by symmetry we have that 


1 




1 




Therefore is the average of all p{if:,') for ;^'g A(;:j). It follows then that 
(j8|) is is maximised if and only if all such are themselves contained 
which in turn implies that itself is contained in □ 


This then leads to our main result. 

Theorem 3 A rank-sum scoring function s of the form in ([1]) is strictly R*- 
proper if and only if ^pf, the preorder induced by Ep[cTi(y)], is an element 
of R*{P) for allPeR. 

Proof By the linearity of expectation, we have that 

n 

Ep[s(P, ;^)] = Ep[^(P)] + 5^Ep[cr,(P)]p,(;^). 

i=l 

By Lemma [H this can be maximised by any contained in ;^p/. These are 
all elements of R*{P) if and only if ;^p/ itself is in R*{P). □ 
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Consequently, the Wilcoxon-Mann-Whitney u function is a strictly R*- 
proper scoring function, however the same cannot be said of the AUC. 

Example 3 Dehne the distribution P on (Yi, Y 2 , ^ 3 , Kt) with the following 
non-zero probabilities: 

F(1,1,0,0) = |, P(0,0,1,0) = ^, P(0,0,0,1) = A. 

Then dehning a as in Example O we have that 

= and E[a(Y)] = (|,|,^, A) . 

Dehne ;^p and as the preorders induced by E[Y] and E[a(Y)], respec¬ 
tively. Then p{^p) = (2, 2, —1, —3) and p{;;^a) = (0, 0, 3, —3), with expected 
AUCs 

E1AUC(F, ;ip)| = § < ElAUC(y, A„)| = 


This rather contrived example is illustrative of how the problem arises, namely 
the denominator of a can alter the relative importance of certain outcomes. 

Nevertheless, there exist certain families P under which AUC is indeed 
proper. 

Theorem 4 If the number of positive outcomes ni{Y) is almost surely con¬ 
stant for all P & IF, then AUC is a strictly R*-proper scoring function. 

Proof If ni(Y) = r almost surely, then Ep[aj(Y)] = Ep[Yi]/((n — r)r). □ 

This justihes the use of AUC as a scoring function in cases where the fore¬ 
caster is informed of the number of positive outcomes beforehand. This 
means that the forecaster is able to use this information to rule out extreme 
tail events that might otherwise have provided a windfall score. For example, 
in the IJCNN Social Network Challenge by Kaggle (https : //www.haggle. com/c/socialNetwork 
competitors were required to estimate 8960 binary outcomes (corresponding 
to presence/absence of an edge), of which they were informed that exactly 
half were positive. 

Theorem 5 If the Yi’s are mutually independent under all P E IF, then 
AUC is a strictly R*-proper scoring function. 

Proof Note that if y^ 7 ^ yj, then ni{y) = 1-1- nf^’''^\y), where nf^^’^\y) = 
and similarly for hq. Then 




Vi - Vj 
no{y)ni{y) 


Vi Vj 

ilUpPiUPhU’ 
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since if j/j = the numerator is zero. Then by mutual independence, 

1 

As the latter expectation is strictly positive, it follows that E[aj(F)] < 
E[aj(f^)] if and only if E[f^i] < E[Yj]. □ 

As noted in section 13.11 mutual independence is a somewhat unrealistic con¬ 
dition for scoring functions. Nevertheless, it can be useful when combined 
with the following result. 

Theorem 6 Let tF consist of distributions P such that there is a latent vari¬ 
able Z whereby 

(i) for almost all Z, Ep\Y \ Z] induces the same preordering as Ep[a(T) | 
Z], and 

(a) this preordering is the same for almost all Z, 
then AUC is a strictly proper scoring function for R*. 

Proof Condition (i) implies that 

Ep[Yi-Yj \ Z]>0 ^ Ep[ai{Y) - aj{Y) \ Z] > 0, 


E[ai(T)] - EK'(T)] = (E[Pi] - E[Tj]) E 


and by condition (ii) then, 

Ep[E[Yi - Yj I Z]J = Ep[Yi - Yf >0 Ep[ai{Y) - aj{Y)] >0. □ 

This provides a means for showing AUC is proper in more general contexts, 
by combining it with one of the previous two theorems to satisfy condition 
(i). For example, if 0 is a parameter in a Bayesian model, conditional on 
which the outcomes are independent {e.g. a logistic regression model), then 
AUC is proper for the predictive distributions if (ii) holds. 

However these conditions can fail if there is signihcant uncertainty in the 
ordering of the outcomes, which may arise in problems such as out-of-sample 
prediction. 

Example 4 Suppose that there are two candidate models, A and B, each 
weighted with probability 1/2, and the forecaster is to rank 100 outcomes, 
of which 10 have a particular feature U present. Suppose that the forecast 
probabilities are 

E[l/ I U„A] = 0.4 E[l/ I =U„A] =0.5 

E[Yi I Uu B] = 0.95 E[U* | B] = 0.9, 
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and that outcomes are independent within each model. Then the resulting 
marginal probabilities are 

I Ui] = 0.675 E[17 I -^U,] = 0.7 

However using the induced ranking will result in an expected AUC of 0.496, 
whereas the opposite ranking will give an expected AUC of 0.504 (see sup¬ 
plementary material). 

4 Scoring functions for mappings 

In many forecasting settings, each variable T) has a corresponding covariate 
or feature X* taking values in some measurable space X., which can be used 
to inform the prediction. In the case where the forecaster is able to observe 
the covariates directly, we can assume any relevant information is taken into 
account, and thus no additional consideration is required. 

However we can also consider the setting in which the forecaster does not 
observe the covariates, but is instead required to provide some sort of map¬ 
ping from the covariate space X = {X.)"' to the original prediction space Z 
for Y (we use the term mapping so as to distinguish from scoring functions). 
In other words, the forecaster is required to make a prediction in the mapping 
prediction space 

Z = {f-.X^Z}. 

Furthermore, any scoring function s : ^ x ^ M has a corresponding 
mapping form s : (A x A’) x Z —)■ R which is simply s evaluated using the 
mapping applied to the observed covariates, 

s{{x,y),f) = s{YJ{X)). 

Similarly, given any statistical functional T ; X —we can define the 
corresponding mapping functional T : Txy Z the mapping of the 
conditional expectation 

f(Pxv)(a:)=T(Py|x=.), 


where Py\x=x denotes the conditional distribution of Y given X = x under 
P. That is, the optimal mapping should map each x G X to the optimal 
prediction under the conditional distribution Py\x=x- 

Theorem 7 Let s be a T-proper scoring function for a family T, then s is 
a T-proper scoring function for Txy if for each Pxy G Xxy, there exists a 
family of conditional distributions {Py\x=x}x which is a subset of T. 
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Proof The expected mapping score is 


E[s((a;,|/),/)] =E[E[s(y,/(X)) |X]]. 

The inner expectation can be maximised for each value of X G X by choosing 
f{x) = argmax^ E[s(X, | X], which, as s is T-proper, will be (an element 
of) T(Py|x=.). □ 


However we typically don’t want to consider all possible mappings / : 
X ^ Z. Instead, we typically are only interested in mappings that can be 
applied coordinate-wise, 

fi^) = where /. : X. M. 


In other words, we constrain the mapping such that the forecast for each 
Yi depends only on its corresponding covariate Xj, and require that this 
mapping be the same for all i. Of course, we also need to constrain the 
family of distributions to ensure that the marginal mapping is coordinate- 
wise. 

Theorem 8 Let X be the set of distributions for (X, Y) such that 

(i) Yi are conditionally independent of X given X,, and 

(a) the distribution ofYt \ X* is the same for all i. 

Then for any M-proper scoring function s for a family X, s is a M-proper 
scoring function for the set of coordinate-wise mappings if the conditional 
distributions Py\x=x are in X. 

Proof By (i) we have that E[Xi | X = x] = E[Xi | Xj = Xj], and by (ii) 
it follows that this quantity is the same for all i. Therefore the mapping 
/(x) = M{Py\x=x) is coordinate-wise, which by Theorem [TJ implies that s 
is M-proper. □ 

Consequently u, the mapping form of u is M-proper for any X satisfying 
(i) and (ii). For AUC to be M-proper, additional conditions are required, 
such as mutual independence of elements of Y conditional on X. 

5 Discussion 

Although we have demonstrated that AUC is not generally a proper scoring 
function. Examples [3] and 0] both exhibit quite extreme dependence between 
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outcomes. Therefore, it might be possible to establish a more relaxed criteria 
for establishing propriety of AUC, for example, bounds on correlation or other 
measures of dependence. 

We have also only considered the batch prediction setting where the fore¬ 
caster is required to provide the preordering for all Y before any outcomes 
have been observed. One alternative is a sequential framework, where at 
each point in time the forecaster is required to provide a forecast for ht+i, 
having already observed Yi,... ,Yt. In the ranking case, this requires the 
forecaster to provide a total preorder on that is compatible with 
the one provided on It. Unfortunately, rank-sum scoring functions are 
essentially useless in this setting. 

Example 5 Let s be any rank-sum scoring rule of the form in (jTj), where 
cri{y) = if yi = yj, and ai{y) > aj{y) if yt > yj (both u and the 

AUC satisfy this property). Then in the sequential setting, it is possible to 
maintain an optimal score by choosing such that 

i -<i+i t + 1 -<t+i j for all U = 0 and Yj = 1. 

By a straightforward application of induction, it is easy to see that such a 
sequence exists, and that it will maintain this “perfect separation”, in that 
all i where Yi = 1 will always be ranked above all j where Yj = 0. Therefore, 
by Lemma [H this will result in the largest possible score {i.e. an AUC of 1): 
note that unlike the previous sections, we refer to actual score, not just the 
expected score. 

In other words, it is possible to construct an optimal procedure with 
absolutely no information whatsoever about the process of T). This problem 
will persist in the analogous mapping problem, where the forecaster is free 
to choose the mapping /* : A. —)• R at each iteration. 
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